Instructions

In this week’s lab, the main goal is to learn and practice how to wrangle a data set. On the due date, turn in your Rmd file and the html product.

Exercise 1

Open your project for this class. Make sure all your work is done relative to this project.

Open the lab3.Rmd file provided with the instructions. You can edit this file and add your answers to questions in this document.

Exercise 2 (4pts)

This is about the french fries example, from lectures notes, make sure the code works for you, and answer these questions:

  1. Does the rating on potato’y change over time?

Some subjects rate the chips lower on potato'y over time, regardless of oil. But for most subjects there is no trend.

  1. Is there a difference in the grassy rating by oil type? Over time?

It looks like the rating on grassy goes down, particularly for weeks 7-9, but they pop up again in week 10. Its hard to say there is a trend.

data(french_fries, package = "reshape2")
french_fries <- as_tibble(french_fries)
ff_m <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep)
ff_av <- ff_m %>% 
  filter(type == "potato") %>%
  group_by(subject, time, treatment) %>%
  summarise(rating=mean(rating))
ggplot(ff_av, aes(x=time, y=rating, colour=treatment, group=treatment)) + 
         geom_line() +
  facet_wrap(~subject) 

ff_grassy <- ff_m %>%
  filter(type == "grassy") 
ggplot(ff_grassy, aes(x=treatment, y=rating)) + 
  geom_boxplot() + scale_y_sqrt() +
facet_wrap(~time)

Exercise 3 (10pts)

For the airlines data, shown in lecture notes, join the airports location with the flight table, and answer these questions:

# A tibble: 146 x 4
   ORIGIN ORIGIN_LATITUDE ORIGIN_LONGITUDE
    <chr>           <dbl>            <dbl>
 1    CVG        39.04889        -84.66778
 2    DFW        32.89722        -97.03778
 3    DFW        32.89722        -97.03778
 4    STL        38.74861        -90.37000
 5    IND        39.71722        -86.29472
 6    CHS        32.89861        -80.04056
 7    DFW        32.89722        -97.03778
 8    DFW        32.89722        -97.03778
 9    MKE        42.94694        -87.89694
10    MKE        42.94694        -87.89694
# ... with 136 more rows, and 1 more variables: DISPLAY_AIRPORT_NAME <chr>

  1. (2pts) What is the plane’s flight path on May 6th? How often does plane N4YRAA fly through DFW on May 6th, 2017? twice
N4YRAA_latlon %>% 
  filter(FL_DATE == ymd("2017-05-06")) %>% 
  select(ORIGIN, DEST)
# A tibble: 3 x 2
  ORIGIN  DEST
   <chr> <chr>
1    DAY   DFW
2    DFW   DSM
3    DSM   DFW
  1. (2pts) What is the plane’s flight path on May 7th? How often does plane N4YRAA fly through DFW on May 7th, 2017? 3 times
N4YRAA_latlon %>% 
  filter(FL_DATE == ymd("2017-05-07")) %>% 
  select(ORIGIN, DEST)
# A tibble: 6 x 2
  ORIGIN  DEST
   <chr> <chr>
1    DFW   MFE
2    MFE   DFW
3    DFW   IAH
4    IAH   DFW
5    DFW   MSY
6    MSY   DFW
  1. (1pt) How many times does it fly through DFW in the month of May, 2017?
N4YRAA_latlon %>% count(ORIGIN) %>% filter(ORIGIN == "DFW")
# A tibble: 1 x 2
  ORIGIN     n
   <chr> <int>
1    DFW    57
  1. (2pts) Does the plane have the same flight path on each week day? E.g. Is it the same for all Mondays? No, we can see by just looking at Mondays, that 1/5 and 8/5 have different routes.
N4YRAA_latlon %>% mutate(wday=wday(FL_DATE, label=TRUE)) %>%
  filter(wday == "Mon")
# A tibble: 20 x 15
      FL_DATE CARRIER FL_NUM ORIGIN  DEST DEP_TIME ARR_TIME DISTANCE
       <date>   <chr>  <int>  <chr> <chr>    <chr>    <chr>    <dbl>
 1 2017-05-01      AA   2663    DAY   DFW     0558     0739      861
 2 2017-05-01      AA    970    DFW   MCI     0852     1027      460
 3 2017-05-01      AA    970    MCI   DFW     1109     1256      460
 4 2017-05-01      AA   2358    DFW   MSY     1350     1507      447
 5 2017-05-01      AA   2358    MSY   DFW     1551     1720      447
 6 2017-05-01      AA   2338    DFW   CHS     1904     2225      987
 7 2017-05-08      AA    137    DFW   PIT     0657     1032     1067
 8 2017-05-08      AA    188    PIT   DFW     1119     1301     1067
 9 2017-05-08      AA   2458    DFW   PIT     1425     1754     1067
10 2017-05-08      AA   2458    PIT   DFW     1852     2038     1067
11 2017-05-08      AA   2176    DFW   OKC     2207     2301      175
12 2017-05-22      AA   1252    ORD   RDU     0732     1021      646
13 2017-05-22      AA   1252    RDU   ORD     1117     1221      646
14 2017-05-22      AA    312    ORD   MSP     1338     1456      334
15 2017-05-22      AA     46    MSP   ORD     1600     1734      334
16 2017-05-22      AA   1323    ORD   STL     1855     2017      258
17 2017-05-29      AA   1556    DFW   IAD     0659     1037     1172
18 2017-05-29      AA   1556    IAD   DFW     1133     1354     1172
19 2017-05-29      AA    386    DFW   JAX     1517     1830      918
20 2017-05-29      AA    386    JAX   DFW     1911     2042      918
# ... with 7 more variables: ORIGIN_LATITUDE <dbl>,
#   ORIGIN_LONGITUDE <dbl>, DISPLAY_AIRPORT_NAME.x <chr>,
#   DEST_LATITUDE <dbl>, DEST_LONGITUDE <dbl>,
#   DISPLAY_AIRPORT_NAME.y <chr>, wday <ord>
  1. (1pt) How many miles did the plane fly in May?
N4YRAA_latlon %>% summarise(sum(DISTANCE))
# A tibble: 1 x 1
  `sum(DISTANCE)`
            <dbl>
1           90089
  1. (1pt) What is the average number of miles flown per day in May?
N4YRAA_latlon %>% summarise(mean(DISTANCE))
# A tibble: 1 x 1
  `mean(DISTANCE)`
             <dbl>
1         612.8503

Exercise 4 (11pts)

This question extends the analysis of the whaleshark data was pulled from https://www.whaleshark.org in 2013. It lists verified encounters with whale sharks across the globe.

whalesharks <- read_csv("whaleshark-encounters.csv")
  1. (1pt) Who is the most prolific sighter of sharks? (You can do this using the email address provided of the sighters. Extra point for finding the actual name of this person by web searching.) grampusr@hotmail.com, I think this is Rafael de la Parra, who is very active in whale shark conservation.
whalesharks %>% count(`Submitter Email Address`, sort=TRUE)
# A tibble: 2,749 x 2
    `Submitter Email Address`     n
                        <chr> <int>
 1       grampusr@hotmail.com  3370
 2        wsphotoid@yahoo.com  2236
 3     dani.rob@dec.wa.gov.au  1267
 4       darcy@whaleshark.org   968
 5        brad@whaleshark.org   914
 6  simon@marinemegafauna.org   802
 7     darcybradley@Gmail.com   575
 8  adove@georgiaaquarium.org   525
 9 sharkwatcharabia@gmail.com   434
10   whaleshark@dec.wa.gov.au   409
# ... with 2,739 more rows
  1. (2pts) What months are sharks most often spotted? Is this different if we only examine sightings from the southern hemisphere? Overall: April, May, June, July; in the southern hemisphere it is clearly April, May, June
whalesharks %>% count(`Month Collected`)
# A tibble: 13 x 2
   `Month Collected`     n
               <int> <int>
 1                 1   853
 2                 2   911
 3                 3  1444
 4                 4  4017
 5                 5  3823
 6                 6  3161
 7                 7  2837
 8                 8  2305
 9                 9  1375
10                10  1030
11                11   765
12                12   791
13                NA    52
whalesharks %>% filter(Latitude < 0) %>% count(`Month Collected`)
# A tibble: 13 x 2
   `Month Collected`     n
               <int> <int>
 1                 1   291
 2                 2   299
 3                 3   347
 4                 4  2241
 5                 5  2686
 6                 6  1882
 7                 7   961
 8                 8   184
 9                 9   183
10                10   305
11                11   274
12                12   203
13                NA     6
  1. (1pt) In Australia, the most common location is off the West Australia coast, near Ningaloo reef. Filter the data based on this geographic region. Use this set for the remaining questions.
library(lubridate)
ningaloo <- whalesharks %>% filter(grepl("Ningaloo", Locality)) %>%
    mutate(date=ymd(paste(`Year Collected`, `Month Collected`, 
                          `Day Collected`, sep="-")))
  1. (2pts) Whaleshark A524 is spotted how many times? Make a trajectory or this whaleshark’s movements, in 2010, overlaid on a map of the area. 62 times
library(ggmap)
map <- get_map(location=c(lon=114.1, lat=-21.9), zoom=8)
ningaloo %>% count(`Marked Individual`, sort=TRUE)
# A tibble: 774 x 2
   `Marked Individual`     n
                 <chr> <int>
 1                <NA>   754
 2               A-524    62
 3               A-098    57
 4               A-001    55
 5               A-102    51
 6               A-534    51
 7               A-268    47
 8               A-425    44
 9               A-529    44
10               A-227    43
# ... with 764 more rows
A524 <- ningaloo %>% filter(`Marked Individual`=="A-524")
ggmap(map) + 
  geom_point(data=A524, aes(x=Longitude, y=Latitude), colour="red") +
  geom_line(data=A524, aes(x=Longitude, y=Latitude), colour="red")

  1. (2pts) Filter the whalesharks that are marked individuals, and plot the distribution of the number of sightings.
ningaloo_nomiss <- ningaloo %>% filter(!is.na(`Marked Individual`))
ningaloo_nomiss %>% count(`Marked Individual`, sort=TRUE) %>%
  ggplot(aes(x=n)) + geom_histogram(binwidth=5)

keep <- ningaloo_nomiss %>% 
  count(`Marked Individual`, sort=TRUE) %>% 
  filter(n>=40) 
ningaloo_freqwhales <- ningaloo %>% 
  filter(`Marked Individual` %in% keep$`Marked Individual`) %>%
  arrange(date)
  1. (3pts) Filter the sharks that have been sighted at least 40 times. Using the code provided plot the tracks of these whales, and explore the general habitat. Do some individuals have specific neighborhoods? In some years do they hang in some neighbourhoods and different locations other years? Are there some months when they appear to be migrating? Nope, the individuals are pretty much all in a group. Nope, they hang in the same neighborhoods each year. There seems to be a bit more movement April-June.
ggmap(map) + geom_point(data=ningaloo_freqwhales, aes(x=Longitude, y=Latitude, 
                                colour=`Marked Individual`)) + 
  geom_line(data=ningaloo_freqwhales, 
            aes(x=Longitude, y=Latitude, 
                colour=`Marked Individual`, group=`Marked Individual`)) + 
  facet_wrap(~`Marked Individual`, ncol=5) + theme_map() +
  theme(legend.position="None")

ggmap(map) + geom_point(data=ningaloo_freqwhales, aes(x=Longitude, y=Latitude, 
                                colour=`Marked Individual`)) + 
  geom_line(data=ningaloo_freqwhales, 
            aes(x=Longitude, y=Latitude, 
                colour=`Marked Individual`, group=`Marked Individual`)) + 
  facet_wrap(~`Year Collected`, ncol=5) + theme_map() +
  theme(legend.position="None")

ggmap(map) + geom_point(data=ningaloo_freqwhales, 
                        aes(x=Longitude, y=Latitude, 
                                colour=`Marked Individual`)) + 
  geom_line(data=ningaloo_freqwhales, 
            aes(x=Longitude, y=Latitude, 
                colour=`Marked Individual`, group=`Marked Individual`)) + 
  facet_wrap(~`Month Collected`, ncol=4) + theme_map() +
  theme(legend.position="None")

Exercise 5 (9pts)

The file budapest.csv has a subset of web click through data related to hotel searches for Budapest. Each line in this data corresponds to a summary of a person looking for a hotel on the Expedia web site. For these questions, the from last lab, do the wrangling and answer them now.

budapest <- read_csv("budapest.csv")
  1. (2pt) If I want to answer this question “What proportion of people searching, actually booked a hotel room?” what would I need to do to the data? (The variable recording the searcher’s final decision is CLICK_THRU_TYP_ID , and the code indicating a booking is 3406). 0.00102, very few
budapest %>% count(CLICK_THRU_TYP_ID) %>% mutate(p=n/sum(n))
# A tibble: 4 x 3
  CLICK_THRU_TYP_ID     n       p
              <int> <int>   <dbl>
1              3402   945 0.01890
2              3404   217 0.00434
3              3406    51 0.00102
4                NA 48787 0.97574
  1. (2pt) If I want to answer the question “What day of the week are most people seaching for hotels?” what would I need to do with the data? (There are two date variables in the data, one when they are searching, SRCH_DATETM, and the other what dates they want to hotel room, SRCH_BEGIN_USE_DATE, SRCH_END_USE_DATE.) Oh, there are only three days where searches are being done for Budapest! Most searches are on Thur.
library(lubridate)
budapest %>% mutate(wday = wday(SRCH_DATETM, label=TRUE)) %>%
  count(wday)
# A tibble: 3 x 2
   wday     n
  <ord> <int>
1 Thurs 25441
2   Fri 18249
3   Sat  6310
  1. (2pt) If I want to answer the question “How far ahead of the check-in date do people typically search for a hotel room?” what needs to done with the data. Looks like it is about 2 weeks - 15 days - the date is very messy, with strange differences like -40087 days, and 4038 days.
date_diff <- budapest %>% 
  mutate(SRCH_BEGIN_USE_DATE=ymd(SRCH_BEGIN_USE_DATE),
           SRCH_DATETM=as.Date(ymd_hms(SRCH_DATETM))) %>%
  select(SRCH_BEGIN_USE_DATE, SRCH_DATETM) %>%
  mutate(dif = as.numeric(SRCH_BEGIN_USE_DATE - SRCH_DATETM)) 
summary(date_diff$dif)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -40087       1      15   -7910      44    4038 
  1. (1pt) There are a lot of missing values in the data, number of NAs, particularly this is true for the booking variable. If an NA essentially means that the person searching quit the site without doing a booking, how would you recode the missing value? Replace them with 0, indicating not booked.

  2. (2pt) If I want to answer the question “Does the existence of a promotion, IS_PROMO_FLAG, tend to result in a higher likelihood of a booking?” what operations do you need to do on the data. No promotion: 9.421185910^{-4}, With promotion: 0.0011355. Appears to help a little - if there were 100000 searches, a promotion would lead to 20 more bookings.

budapest %>% replace_na(list(CLICK_THRU_TYP_ID=0)) %>%
  mutate(booked=ifelse(CLICK_THRU_TYP_ID == 3406, "Y", "N")) %>%
  group_by(IS_PROMO_FLAG) %>%
  count(booked)
# A tibble: 6 x 3
# Groups:   IS_PROMO_FLAG [3]
  IS_PROMO_FLAG booked     n
          <chr>  <chr> <int>
1             N      N 33934
2             N      Y    32
3             Y      N 14955
4             Y      Y    17
5          <NA>      N  1060
6          <NA>      Y     2

Exercise 6 (13pts)

This question is about the 2015 PISA results. (We used the 2012 data in lab 1.) The data is downloaded from http://www.oecd.org/pisa/data/2015database/. The SPSS format “Student questionnaire data file (419MB)” is downloaded and processed using this code, to extract results for Australia:

library(haven)
pisa_2015 <- read_sav(file.choose())
pisa_au <- pisa_2015 %>% filter(CNT == "AUS")
save(pisa_au, file="pisa_au.rda")

You don’t need to do this, because the Australia data is extracted and saved for you. Your task is to answer these questions about Australia. At times it may be helpful to examine the data dictionary, which is provided as an excel file (you can also download this directly from the OECD PISA site too).

A large amount of pre-processing is done on the data, as performed by this code:

load("pisa_au.rda")
pisa_au <- pisa_au %>% mutate(state=as.character(substr(STRATUM, 4, 5)),
                schtype_yr=as.character(substr(STRATUM, 6, 7))) %>%
  mutate(state=recode(state, "01"="ACT", "02"="NSW", "03"="VIC",
       "04"="QLD", "05"="SA", "06"="WA", "07"="TAS", "08"="NT")) %>%
  mutate(schtype_yr=recode(schtype_yr,
            "01"="Catholic_Y10", "02"="Catholic_noY10",
            "03"="Gov_Y10", "04"="Gov_noY10",
            "05"="Ind_Y10", "06"="Ind_noY10",
            "07"="Catholic_Y10", "08"="Catholic_noY10",
            "09"="Gov_Y10", "10"="Gov_noY10",
            "11"="Ind_Y10", "12"="Ind_noY10",
            "13"="Catholic_Y10", "14"="Catholic_noY10",
            "15"="Gov_Y10", "16"="Gov_noY10",
            "17"="Ind_Y10", "18"="Ind_noY10",
            "19"="Catholic_Y10", "20"="Catholic_noY10",
            "21"="Gov_Y10", "22"="Gov_noY10",
            "23"="Ind_Y10", "24"="Ind_noY10",
            "25"="Catholic_Y10", "26"="Catholic_noY10",
            "27"="Gov_Y10", "28"="Gov_noY10",
            "29"="Ind_Y10", "30"="Ind_noY10",
            "31"="Catholic_Y10", "32"="Catholic_noY10",
            "33"="Gov_Y10", "34"="Gov_noY10",
            "35"="Ind_Y10", "36"="Ind_noY10",
            "37"="Catholic_Y10", "38"="Catholic_noY10",
            "39"="Gov_Y10", "40"="Gov_noY10",
            "41"="Ind_Y10", "42"="Ind_noY10",
            "43"="Catholic_Y10", "44"="Catholic_noY10",
            "45"="Gov_Y10", "46"="Gov_noY10",
            "47"="Ind_Y10", "48"="Ind_noY10")) %>%
  separate(schtype_yr, c("schtype","yr")) %>%
  rename(birthmonth=ST003D02T, birthyr=ST003D03T,
         gender=ST004D01T, desk=ST011Q01TA,
         room=ST011Q02TA, computer=ST011Q04TA, internet=ST011Q06TA,
         solarpanels=ST011D17TA, tvs=ST012Q01TA, cars=ST012Q02TA,
         music_instr=ST012Q09NA, books=ST013Q01TA, birthcnt=ST019AQ01T,
         mother_birthcnt=ST019BQ01T, father_birthcnt=ST019CQ01T,
         test_anxiety=ST118Q01NA, ambitious=ST119Q04NA,
         prefer_team=ST082Q01NA, make_friends_easy=ST034Q02TA,
         tardy=ST062Q03TA, science_fun=ST094Q01NA, breakfast=ST076Q01NA,
         work_pay=ST078Q10NA, sport=ST078Q11NA, internet_use=IC006Q01TA,
         install_software=IC015Q02NA,
         outhours_study=OUTHOURS, math_time=MMINS, read_time=LMINS,
         science_time=SMINS, belong=BELONG,
         anxtest=ANXTEST, motivat=MOTIVAT, language=LANGN,
         home_edres=HEDRES, home_poss=HOMEPOS, wealth=WEALTH,
         stuweight=W_FSTUWT) %>%
    mutate(math=(PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH+
                     PV6MATH+PV7MATH+PV8MATH+PV9MATH+PV10MATH)/10,
           science=(PV1SCIE+PV2SCIE+PV3SCIE+PV4SCIE+PV5SCIE+
                        PV6SCIE+PV7SCIE+PV8SCIE+PV9SCIE+PV10SCIE)/10,
           read=(PV1READ+PV2READ+PV3READ+PV4READ+PV5READ+
                     PV6READ+PV7READ+PV8READ+PV9READ+PV10READ)/10) %>%
   select(state, schtype, yr, birthmonth, birthyr, gender, desk, room,
          computer, internet, solarpanels, tvs, cars, music_instr, books,
          birthcnt, mother_birthcnt, father_birthcnt, test_anxiety,
          ambitious, prefer_team, make_friends_easy, tardy, science_fun,
          breakfast, work_pay, sport, internet_use, install_software,
          outhours_study, math_time, read_time, science_time, belong,
          anxtest, motivat, language, home_edres, home_poss, wealth,
          stuweight, math, science, read) %>%
  mutate(gender=factor(gender, levels=1:2, labels=c("female", "male"))) %>% 
  mutate(birthmonth=factor(birthmonth, levels=1:12,
    labels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
                            "sep", "oct", "nov", "dec")))
  1. (2pt) Explain how the STRATUM variable is processed to create three new variables state, schtype and yr. The 4-5th digits code the state, and these are extracted into the state variable. The 6-7th digits code both school type and year tested. This latter one is complicated to detangle, so a new text string was created which was separated into the two variables.
  2. (2pt) Compute the average of math scores by state. Which state does best, on average, on math? (You should use the stuweight variable to compute a weighted average. This is survey data, and the weights indicate how representative that individual is of the population.) ACT
pisa_au %>% group_by(state) %>% summarise(m=weighted.mean(math, stuweight))
# A tibble: 8 x 2
  state        m
  <chr>    <dbl>
1   ACT 505.4857
2   NSW 494.2584
3    NT 478.0795
4   QLD 486.3467
5    SA 489.3091
6   TAS 468.9917
7   VIC 498.6085
8    WA 503.8274
  1. (2pt) Compute the difference in average male and female math scores by state. Which state has the smallest average gender difference? QLD
pisa_au %>% group_by(state, gender) %>% 
  summarise(m=weighted.mean(math, stuweight)) %>%
  spread(gender, m) %>%
  mutate(dif=female-male)
# A tibble: 8 x 4
# Groups:   state [8]
  state   female     male        dif
  <chr>    <dbl>    <dbl>      <dbl>
1   ACT 502.1651 508.7725  -6.607420
2   NSW 492.0413 496.4775  -4.436208
3    NT 463.6726 492.1481 -28.475479
4   QLD 486.8992 485.8188   1.080436
5    SA 485.8819 492.5379  -6.655965
6   TAS 466.9742 470.8269  -3.852642
7   VIC 491.9853 505.2567 -13.271372
8    WA 501.4735 506.1641  -4.690684
  1. (2pt) Does test anxiety have an effect math score? (Use the variable anxtest, and a simple regression model to answer this question.) Yes, it appears to reduce scores by about 12 points. Its a very weak model, through because R2 is only 2%.
fit <- lm(math~anxtest, weights=stuweight, data=pisa_au)
summary(fit)

Call:
lm(formula = math ~ anxtest, data = pisa_au, weights = stuweight)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-1645.99  -224.24   -29.83   194.74  1467.37 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 498.6578     0.7291  683.92   <2e-16 ***
anxtest     -11.7413     0.7349  -15.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 356.3 on 13896 degrees of freedom
  (632 observations deleted due to missingness)
Multiple R-squared:  0.01804,   Adjusted R-squared:  0.01797 
F-statistic: 255.3 on 1 and 13896 DF,  p-value: < 2.2e-16
  1. (1pt) Explain what the rename operation is doing. It is giving the original variable names more understandable names.
  2. (4pt) Come up with two more questions as a group, based on the data description. Do the wrangling to answer these questions.